We will make extensive use of Python and various associated libraries and so the first thing we need to ensure is that we all have a common setup and are using the same software. The Python distribution that we have decided to use is Anaconda which can be downloaded from here (although we hope that you have already done this prior to the school). Make sure that you installed the Python 2.7 version for your operating system (there is nothing wrong with Python 3.x but it is slightly different syntactically). Python 2 will be supported until 2020. We promote to write code that supports Python 2.7 and 3.x simultaneously, and you can use the future package explained here
For installing Anaconda in Linux follow this link
For installing Anaconda in Mac follow this link
For installing Anaconda in Windows follow this link
We will follow the LSST Data Managment Style guide explained here
One of the advantages of the Anaconda distribution is that it comes with many of the most commonly-used Python packages, such as numpy, scipy, and scikit-learn, preinstalled. However, if you do need to install a new package then it is very straightforward: you can either use the Anaconda installation tool conda or the generic Python tool pip (both use a different registry of available packages and sometimes a particular package will not available via one tool but will be via the other).
For example, irlbpy is a superfast algorithm for finding the largest eigenvalues (and corresponding eigenvectors) of very large matrices. We can try to install it first with conda:
conda install irlbpy
but this will not find it:
Fetching package metadata: ....
Error: No packages found in current osx-64 channels matching: irlbpy
You can search for this package on Binstar with
binstar search -t conda irlbpy
</code>
so instead we try with pip:
pip install irlbpy
In the event that both fail, you always just download the package source code and then install it manually with:
python install setup.py
in the appropriate source directory.
We'll now take a brief look at a few of the main Python packages.
The standard way to use the Python programming language is to use the Python interpreter to run python code. The python interpreter is a program that reads and execute the python code in files passed to it as arguments. At the command prompt, the command python is used to invoke the Python interpreter.
For example, to run a file my-program.py that contains python code from the command prompt, use::
$ python my-program.py
We can also start the interpreter by simply typing python at the command line, and interactively type python code into the interpreter.
This is often how we want to work when developing scientific applications, or when doing small calculations. But the standard python interpreter is not very convenient for this kind of work, due to a number of limitations.
IPython is an interactive shell that addresses the limitation of the standard python interpreter, and it is a work-horse for scientific use of python. It provides an interactive prompt to the python interpreter with a greatly improved user-friendliness.
Some of the many useful features of IPython includes:
Command history, which can be browsed with the up and down arrows on the keyboard. Tab auto-completion. In-line editing of code. Object introspection, and automatic extract of documentation strings from python objects like classes and functions. Good interaction with operating system shell. Support for multiple parallel back-end processes, that can run on computing clusters or cloud services like Amazon EE2.
Jupyter notebook is an HTML-based notebook environment for Python, similar to Mathematica or Maple. It is based on the IPython shell, but provides a cell-based environment with great interactivity, where calculations can be organized and documented in a structured way.
Although using a web browser as graphical interface, IPython notebooks are usually run locally, from the same computer that run the browser. To start a new Jupyter notebook session, run the following command:
$ jupyter notebook
from a directory where you want the notebooks to be stored. This will open a new browser window (or a new tab in an existing window) with an index page where existing notebooks are shown and from which new notebooks can be created. Usually, the URL for the Jupyter notebook is http://localhost:8888
NumPy is the main Python package for working with N-dimensional arrays. Any list of numbers can be recast as a NumPy array:
In [58]:
import numpy as np
x = np.array([1, 5, 3, 4, 2])
x
Out[58]:
Arrays have a number of useful methods associated with them:
In [59]:
print x.min(), x.max(), x.sum(), x.argmin(), x.argmax()
and NumPy functions can act on arrays in an elementwise fashion:
In [60]:
np.sin(x * np.pi / 180.)
Out[60]:
Ranges of values are easily produced:
In [61]:
np.arange(1, 10, 0.5)
Out[61]:
In [62]:
np.linspace(1, 10, 5)
Out[62]:
In [63]:
np.logspace(1, 3, 5)
Out[63]:
Random numbers are also easily generated in the half-open interval [0, 1):
In [64]:
help(np.random.random)
In [65]:
np.random.random(10)
Out[65]:
or from one of the large number of statistical distributions provided:
In [66]:
np.random.normal(loc = 2.5, scale = 5, size = 10)
Out[66]:
Another useful method is the where function for identifying elements that satisfy a particular condition:
In [67]:
x = np.random.normal(size = 100)
np.where(x > 3.)
Out[67]:
Of course, all of these work equally well with multidimensional arrays.
In [68]:
x = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
np.sin(x)
Out[68]:
Data can also be automatically loaded from a file into a Numpy array via the loadtxt or genfromtxt methods:
In [69]:
data = np.loadtxt("data/sample_data.csv", delimiter = ",", skiprows = 3)
data
Out[69]:
Matplotlib is Python's most popular and comprehensive plotting library that is especially useful in combination with NumPy/SciPy.
In [70]:
import matplotlib.pyplot as plt
%matplotlib inline
x=np.array([1,2,3,4,5,6,7,8,9,10])
y=x**2
plt.plot(x,y)
plt.xlabel('X-axis title')
plt.ylabel('Y-axis title')
Out[70]:
In [71]:
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)
# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
Out[71]:
In [72]:
help(plt.hist)
In [73]:
x=10+5*np.random.randn(10000)
n, bins, patches=plt.hist(x,bins=30)
AstroPy aims to provide a core set of subpackages to specifically support astronomy. These include methods to work with image and table data formats, e.g., FITS, VOTable, etc., along with astronomical coordinate and unit systems, and cosmological calculations.
In [74]:
from astropy import units as u
from astropy.coordinates import SkyCoord
c = SkyCoord(ra = 10.625 * u.degree, dec = 41.2 * u.degree, frame = 'icrs')
print c.to_string('hmsdms')
print c.galactic
In [75]:
from astropy.cosmology import WMAP9 as cosmo
print cosmo.comoving_distance(1.25), cosmo.luminosity_distance(1.25)
You can read FITS images and VOTables using astropy
In [76]:
from astropy.io import fits
hdulist = fits.open('data/sample_image.fits')
hdulist.info()
In [77]:
data=hdulist[0].data
data
Out[77]:
In [78]:
from astropy.io.votable import parse
votable = parse('data/sample_votable.xml')
table = votable.get_first_table()
In [79]:
fields=table.fields
data=table.array
print fields
print data
A useful affiliated package is Astroquery which provides tools for querying astronomical web forms and databases. This is not part of the regular AstroPy distribution and needs to be installed separately. Whereas many data archives have standardized VO interfaces to support data access, Astroquery mimics a web browser and provides access via an archive's form interface. This can be useful as not all provided information is necesarily available via the VO.
For example, the NASA Extragalactic Database is a very useful human-curated resource for extragalactic objects. However, a lot of the information that is available via the web pages is not available through an easy programmatic API. Let's say that we want to get the list of object types associated with a particulae source:
In [80]:
from astroquery.ned import Ned
coo = SkyCoord(ra = 56.38, dec = 38.43, unit = (u.deg, u.deg))
result = Ned.query_region(coo, radius = 0.07 * u.deg)
set(result.columns['Type'])
Out[80]:
SciPy provides a number of subpackages that deal with common operations in scientific computing, such as numerical integration, optimization, interpolation, Fourier transforms and linear algebra.
In [81]:
f = lambda x: np.cos(-x ** 2 / 9.)
x = np.linspace(0, 10, 11)
y = f(x)
from scipy.interpolate import interp1d
f1 = interp1d(x, y)
f2 = interp1d(x, y, kind = 'cubic')
from scipy.integrate import quad
print quad(f1, 0, 10)
print quad(f2, 0, 10)
print quad(f, 0, 10)
scikit-learn provides algorithms for machine learning tasks, such as classification, regression, and clustering, as well as associated operations, such as cross-validation and feature normalization. A related module is astroML which is a wrapper around a lot of the scikit-learn routines but also offers some additional functionality and faster/alternate implementations of some methods.
pandas offers data structures, particularly data frames, and operations for manipulating numerical tables and time series, such as fancy indexing, reshaping and pivoting, and merging, as well as a number of analysis tools. Although similar functionality already exists in numpy, pandas is highly optimized for performance and large data sets.
For some of the other lectures or projects this week, you might also need to install the following Python packages: